{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Otto商品分类——LightGBM\n",
    "原始特征+tfidf特征"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们以Kaggle 2015年举办的Otto Group Product Classification Challenge竞赛数据为例，对LightGBM的超参数进行调优。\n",
    "\n",
    "Otto数据集是著名电商Otto提供的一个多类商品分类问题，类别数=9. 每个样本有93维数值型特征（整数，表示某种事件发生的次数，已经进行过脱敏处理）。 竞赛官网：https://www.kaggle.com/c/otto-group-product-classification-challenge/data\n",
    "\n",
    "\n",
    "第一名：https://www.kaggle.com/c/otto-group-product-classification-challenge/discussion/14335\n",
    "第二名：http://blog.kaggle.com/2015/06/09/otto-product-classification-winners-interview-2nd-place-alexander-guschin/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 首先 import 必要的模块\n",
    "import pandas as pd \n",
    "import numpy as np\n",
    "\n",
    "import lightgbm as lgbm\n",
    "from lightgbm.sklearn import LGBMClassifier\n",
    "\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 读取数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "# 读取数据\n",
    "# 这里使用原始特征+tf_idf特征，log(x+1)特征为原始特的单调变换，加上log特征对决策树模型影响不大\n",
    "# path to where the data lies\n",
    "dpath = './data/'\n",
    "\n",
    "train1 = pd.read_csv(dpath +\"Otto_FE_train_org.csv\")\n",
    "train2 = pd.read_csv(dpath +\"Otto_FE_train_tfidf.csv\")\n",
    "\n",
    "#去掉多余的id\n",
    "train2 = train2.drop([\"id\",\"target\"], axis=1)\n",
    "train =  pd.concat([train1, train2], axis = 1, ignore_index=False)\n",
    "train.head()\n",
    "\n",
    "del train1\n",
    "del train2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 准备数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 将类别字符串变成数字，LightGBM不支持字符串格式的特征输入/标签输入\n",
    "y_train = train['target'] #形式为Class_x\n",
    "y_train = y_train.map(lambda s: s[6:])\n",
    "y_train = y_train.map(lambda s: int(s) - 1)#将类别的形式由Class_x变为0-8之间的整数\n",
    "\n",
    "X_train = train.drop([\"id\", \"target\"], axis=1)\n",
    "\n",
    "#保存特征名字以备后用（可视化）\n",
    "feat_names = X_train.columns \n",
    "\n",
    "#sklearn的学习器大多之一稀疏数据输入，模型训练会快很多\n",
    "#查看一个学习器是否支持稀疏数据，可以看fit函数是否支持: X: {array-like, sparse matrix}.\n",
    "#可自行用timeit比较稠密数据和稀疏数据的训练时间\n",
    "from scipy.sparse import csr_matrix\n",
    "X_train = csr_matrix(X_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## LightGBM超参数调优"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "LightGBM的主要的超参包括：\n",
    "1. 树的数目n_estimators 和 学习率 learning_rate\n",
    "2. 树的最大深度max_depth 和 树的最大叶子节点数目num_leaves（注意：XGBoost只有max_depth，LightGBM采用叶子优先的方式生成树，num_leaves很重要，设置成比 2^max_depth 小）\n",
    "3. 叶子结点的最小样本数:min_data_in_leaf(min_data, min_child_samples)\n",
    "4. 每棵树的列采样比例：feature_fraction/colsample_bytree\n",
    "5. 每棵树的行采样比例：bagging_fraction （需同时设置bagging_freq=1）/subsample\n",
    "6. 正则化参数lambda_l1(reg_alpha), lambda_l2(reg_lambda)\n",
    "\n",
    "7. 两个非模型复杂度参数，但会影响模型速度和精度。可根据特征取值范围和样本数目修改这两个参数\n",
    "1）特征的最大bin数目max_bin：默认255；\n",
    "2）用来建立直方图的样本数目subsample_for_bin：默认200000。\n",
    "\n",
    "对n_estimators，用LightGBM内嵌的cv函数调优，因为同XGBoost一样，LightGBM学习的过程内嵌了cv，速度极快。\n",
    "其他参数用GridSearchCV"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "MAX_ROUNDS = 10000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 相同的交叉验证分组"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# prepare cross validation\n",
    "from sklearn.model_selection import StratifiedKFold\n",
    "\n",
    "kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. n_estimators"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "#直接调用lightgbm内嵌的交叉验证(cv)，可对连续的n_estimators参数进行快速交叉验证\n",
    "#而GridSearchCV只能对有限个参数进行交叉验证，且速度相对较慢\n",
    "def get_n_estimators(params , X_train , y_train , early_stopping_rounds=10):\n",
    "    lgbm_params = params.copy()\n",
    "    lgbm_params['num_class'] = 9\n",
    "     \n",
    "    lgbmtrain = lgbm.Dataset(X_train , y_train )\n",
    "     \n",
    "    #num_boost_round为弱分类器数目，下面的代码参数里因为已经设置了early_stopping_rounds\n",
    "    #即性能未提升的次数超过过早停止设置的数值，则停止训练\n",
    "    cv_result = lgbm.cv(lgbm_params , lgbmtrain , num_boost_round=MAX_ROUNDS , nfold=3,  metrics='multi_logloss' , early_stopping_rounds=early_stopping_rounds,seed=3 )\n",
    "     \n",
    "    print('best n_estimators:' , len(cv_result['multi_logloss-mean']))\n",
    "    print('best cv score:' , cv_result['multi_logloss-mean'][-1])\n",
    "     \n",
    "    return len(cv_result['multi_logloss-mean'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "best n_estimators: 346\n",
      "best cv score: 0.47470277411045786\n"
     ]
    }
   ],
   "source": [
    "params = {'boosting_type': 'gbdt',\n",
    "          'objective': 'multiclass',\n",
    "          'n_jobs': 4,\n",
    "          'learning_rate': 0.1,\n",
    "          'num_leaves': 70,\n",
    "          'max_depth': 7,\n",
    "          'max_bin': 127, #2^6,原始特征为整数，很少超过100\n",
    "          'subsample': 0.8,\n",
    "          'bagging_freq': 1,\n",
    "          'colsample_bytree': 0.4,\n",
    "         }\n",
    "\n",
    "n_estimators_1 = get_n_estimators(params , X_train , y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. num_leaves & max_depth=7\n",
    "num_leaves建议70-80，搜索区间50-80,值越大模型越复杂，越容易过拟合\n",
    "相应的扩大max_depth=7"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 3 folds for each of 4 candidates, totalling 12 fits\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.\n",
      "[Parallel(n_jobs=4)]: Done   8 out of  12 | elapsed:  2.4min remaining:  1.2min\n",
      "[Parallel(n_jobs=4)]: Done  12 out of  12 | elapsed:  3.4min finished\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=3, shuffle=True),\n",
       "             error_score=nan,\n",
       "             estimator=LGBMClassifier(bagging_freq=1, boosting_type='gbdt',\n",
       "                                      class_weight=None, colsample_bytree=0.7,\n",
       "                                      importance_type='split',\n",
       "                                      learning_rate=0.1, max_bin=127,\n",
       "                                      max_depth=7, min_child_samples=20,\n",
       "                                      min_child_weight=0.001,\n",
       "                                      min_split_gain=0.0, n_estimators=346,\n",
       "                                      n_jobs=4, num_class=9, num_leaves=31,\n",
       "                                      objective='multiclass', random_state=None,\n",
       "                                      reg_alpha=0.0, reg_lambda=0.0,\n",
       "                                      silent=False, subsample=0.7,\n",
       "                                      subsample_for_bin=200000,\n",
       "                                      subsample_freq=0),\n",
       "             iid='deprecated', n_jobs=4,\n",
       "             param_grid={'num_leaves': range(50, 90, 10)},\n",
       "             pre_dispatch='2*n_jobs', refit=False, return_train_score=False,\n",
       "             scoring='neg_log_loss', verbose=5)"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "params = {'boosting_type': 'gbdt',\n",
    "          'objective': 'multiclass',\n",
    "          'num_class':9, \n",
    "          'n_jobs': 4,\n",
    "          'learning_rate': 0.1,\n",
    "          'n_estimators':n_estimators_1,\n",
    "          'max_depth': 7,\n",
    "          'max_bin': 127, #2^6,原始特征为整数，很少超过100\n",
    "          'subsample': 0.7,\n",
    "          'bagging_freq': 1,\n",
    "          'colsample_bytree': 0.7,\n",
    "         }\n",
    "lg = LGBMClassifier(silent=False,  **params)\n",
    "\n",
    "num_leaves_s = range(50,90,10) #50,60,70,80\n",
    "tuned_parameters = dict( num_leaves = num_leaves_s)\n",
    "\n",
    "grid_search = GridSearchCV(lg, n_jobs=4, param_grid=tuned_parameters, cv = kfold, scoring=\"neg_log_loss\", verbose=5, refit = False)\n",
    "grid_search.fit(X_train , y_train)\n",
    "#grid_search.best_estimator_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.4753007676647278\n",
      "{'num_leaves': 50}\n"
     ]
    }
   ],
   "source": [
    "# examine the best model\n",
    "print(-grid_search.best_score_)\n",
    "print(grid_search.best_params_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "ename": "KeyError",
     "evalue": "'mean_train_score'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mKeyError\u001b[0m                                  Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-19-9bcdb0775b77>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[0mtest_means\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgrid_search\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcv_results_\u001b[0m\u001b[0;34m[\u001b[0m \u001b[0;34m'mean_test_score'\u001b[0m \u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      3\u001b[0m \u001b[0mtest_stds\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgrid_search\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcv_results_\u001b[0m\u001b[0;34m[\u001b[0m \u001b[0;34m'std_test_score'\u001b[0m \u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 4\u001b[0;31m \u001b[0mtrain_means\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgrid_search\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcv_results_\u001b[0m\u001b[0;34m[\u001b[0m \u001b[0;34m'mean_train_score'\u001b[0m \u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      5\u001b[0m \u001b[0mtrain_stds\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgrid_search\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcv_results_\u001b[0m\u001b[0;34m[\u001b[0m \u001b[0;34m'std_train_score'\u001b[0m \u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      6\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mKeyError\u001b[0m: 'mean_train_score'"
     ]
    }
   ],
   "source": [
    "# plot CV误差曲线\n",
    "test_means = grid_search.cv_results_[ 'mean_test_score' ]\n",
    "test_stds = grid_search.cv_results_[ 'std_test_score' ]\n",
    "train_means = grid_search.cv_results_[ 'mean_train_score' ]\n",
    "train_stds = grid_search.cv_results_[ 'std_train_score' ]\n",
    "\n",
    "n_leafs = len(num_leaves_s)\n",
    "\n",
    "x_axis = num_leaves_s\n",
    "plt.plot(x_axis, -test_means)\n",
    "#plt.errorbar(x_axis, -test_means, yerr=test_stds,label = ' Test')\n",
    "#plt.errorbar(x_axis, -train_means, yerr=train_stds,label = ' Train')\n",
    "plt.xlabel( 'num_leaves' )\n",
    "plt.ylabel( 'Log Loss' )\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-0.47530077, -0.47648816, -0.47638364, -0.47624663])"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_means"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 性能抖动，取系统推荐值：70, 不必再细调"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. min_child_samples\n",
    "叶子节点的最小样本数目\n",
    "\n",
    "叶子节点数目：70，共9类，平均每类8个叶子节点\n",
    "每棵树的样本数目数目最少的类（稀有事件）的样本数目：200 * 2/3 * 0.7 = 100\n",
    "所以每个叶子节点约100/8 = 12个样本点\n",
    "\n",
    "搜索范围：10-50"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 3 folds for each of 4 candidates, totalling 12 fits\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.\n",
      "[Parallel(n_jobs=4)]: Done   8 out of  12 | elapsed:  2.0min remaining:   60.0s\n",
      "[Parallel(n_jobs=4)]: Done  12 out of  12 | elapsed:  2.9min finished\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=3, shuffle=True),\n",
       "             error_score=nan,\n",
       "             estimator=LGBMClassifier(bagging_freq=1, boosting_type='gbdt',\n",
       "                                      class_weight=None, colsample_bytree=0.4,\n",
       "                                      importance_type='split',\n",
       "                                      learning_rate=0.1, max_bin=127,\n",
       "                                      max_depth=7, min_child_samples=20,\n",
       "                                      min_child_weight=0.001,\n",
       "                                      min_split_gain=0.0, n_estimators=346,\n",
       "                                      n_jobs=4, num_class=9, num_leaves=70,\n",
       "                                      objective='multiclass', random_state=None,\n",
       "                                      reg_alpha=0.0, reg_lambda=0.0,\n",
       "                                      silent=False, subsample=0.8,\n",
       "                                      subsample_for_bin=200000,\n",
       "                                      subsample_freq=0),\n",
       "             iid='deprecated', n_jobs=4,\n",
       "             param_grid={'min_child_samples': range(10, 50, 10)},\n",
       "             pre_dispatch='2*n_jobs', refit=False, return_train_score=False,\n",
       "             scoring='neg_log_loss', verbose=5)"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "params = {'boosting_type': 'gbdt',\n",
    "          'objective': 'multiclass',\n",
    "          'num_class':9, \n",
    "          'n_jobs': 4,\n",
    "          'learning_rate': 0.1,\n",
    "          'n_estimators':n_estimators_1,\n",
    "          'max_depth': 7,\n",
    "          'num_leaves':70,\n",
    "          'max_bin': 127, #2^6,原始特征为整数，很少超过100\n",
    "          'subsample': 0.8,\n",
    "          'bagging_freq': 1,\n",
    "          'colsample_bytree': 0.4,\n",
    "         }\n",
    "lg = LGBMClassifier(silent=False,  **params)\n",
    "\n",
    "min_child_samples_s = range(10,50,10) \n",
    "tuned_parameters = dict( min_child_samples = min_child_samples_s)\n",
    "\n",
    "grid_search = GridSearchCV(lg, n_jobs=4,  param_grid=tuned_parameters, cv = kfold, scoring=\"neg_log_loss\", verbose=5, refit = False)\n",
    "grid_search.fit(X_train , y_train)\n",
    "# grid_search.best_estimator_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.4735328925862344\n",
      "{'min_child_samples': 40}\n"
     ]
    }
   ],
   "source": [
    "# examine the best model\n",
    "print(-grid_search.best_score_)\n",
    "print(grid_search.best_params_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "ename": "KeyError",
     "evalue": "'mean_train_score'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mKeyError\u001b[0m                                  Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-24-e33e26c4c3e6>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[0mtest_means\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgrid_search\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcv_results_\u001b[0m\u001b[0;34m[\u001b[0m \u001b[0;34m'mean_test_score'\u001b[0m \u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      3\u001b[0m \u001b[0mtest_stds\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgrid_search\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcv_results_\u001b[0m\u001b[0;34m[\u001b[0m \u001b[0;34m'std_test_score'\u001b[0m \u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 4\u001b[0;31m \u001b[0mtrain_means\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgrid_search\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcv_results_\u001b[0m\u001b[0;34m[\u001b[0m \u001b[0;34m'mean_train_score'\u001b[0m \u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      5\u001b[0m \u001b[0mtrain_stds\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgrid_search\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcv_results_\u001b[0m\u001b[0;34m[\u001b[0m \u001b[0;34m'std_train_score'\u001b[0m \u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      6\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mKeyError\u001b[0m: 'mean_train_score'"
     ]
    }
   ],
   "source": [
    "# plot CV误差曲线\n",
    "test_means = grid_search.cv_results_[ 'mean_test_score' ]\n",
    "test_stds = grid_search.cv_results_[ 'std_test_score' ]\n",
    "train_means = grid_search.cv_results_[ 'mean_train_score' ]\n",
    "train_stds = grid_search.cv_results_[ 'std_train_score' ]\n",
    "\n",
    "x_axis = min_child_samples_s\n",
    "\n",
    "plt.plot(x_axis, -test_means)\n",
    "#plt.errorbar(x_axis, -test_scores, yerr=test_stds ,label = ' Test')\n",
    "#plt.errorbar(x_axis, -train_scores, yerr=train_stds,label =  +' Train')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-0.47663136, -0.47476954, -0.4739517 , -0.47353289])"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_means"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### min_child_samples=30"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 行采样参数 sub_samples/bagging_fraction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 3 folds for each of 5 candidates, totalling 15 fits\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.\n",
      "[Parallel(n_jobs=4)]: Done  12 out of  15 | elapsed:  2.6min remaining:   39.3s\n",
      "[Parallel(n_jobs=4)]: Done  15 out of  15 | elapsed:  3.2min finished\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=3, shuffle=True),\n",
       "             error_score=nan,\n",
       "             estimator=LGBMClassifier(bagging_freq=1, boosting_type='gbdt',\n",
       "                                      class_weight=None, colsample_bytree=0.4,\n",
       "                                      importance_type='split',\n",
       "                                      learning_rate=0.1, max_bin=127,\n",
       "                                      max_depth=7, min_child_samples=40,\n",
       "                                      min_child_weight=0.001,\n",
       "                                      min_split_gain=0.0, n_estimators=346,\n",
       "                                      n_jobs=4, num_class=9, num_leaves=70,\n",
       "                                      objective='multiclass', random_state=None,\n",
       "                                      reg_alpha=0.0, reg_lambda=0.0,\n",
       "                                      silent=False, subsample=1.0,\n",
       "                                      subsample_for_bin=200000,\n",
       "                                      subsample_freq=0),\n",
       "             iid='deprecated', n_jobs=4,\n",
       "             param_grid={'subsample': [0.5, 0.6, 0.7, 0.8, 0.9]},\n",
       "             pre_dispatch='2*n_jobs', refit=False, return_train_score=False,\n",
       "             scoring='neg_log_loss', verbose=5)"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "params = {'boosting_type': 'gbdt',\n",
    "          'objective': 'multiclass',\n",
    "          'num_class':9, \n",
    "          'n_jobs': 4,\n",
    "          'learning_rate': 0.1,\n",
    "          'n_estimators':n_estimators_1,\n",
    "          'max_depth': 7,\n",
    "          'num_leaves':70,\n",
    "          'min_child_samples':40,\n",
    "          'max_bin': 127, #2^6,原始特征为整数，很少超过100\n",
    "          #'subsample': 0.7,\n",
    "          'bagging_freq': 1,\n",
    "          'colsample_bytree': 0.4,\n",
    "         }\n",
    "lg = LGBMClassifier(silent=False,  **params)\n",
    "\n",
    "subsample_s = [i/10.0 for i in range(5,10)]\n",
    "tuned_parameters = dict( subsample = subsample_s)\n",
    "\n",
    "grid_search = GridSearchCV(lg, n_jobs=4,  param_grid=tuned_parameters, cv = kfold, scoring=\"neg_log_loss\", verbose=5, refit = False)\n",
    "grid_search.fit(X_train , y_train)\n",
    "#grid_search.best_estimator_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.4735328925862344\n",
      "{'subsample': 0.8}\n"
     ]
    }
   ],
   "source": [
    "# examine the best model\n",
    "print(-grid_search.best_score_)\n",
    "print(grid_search.best_params_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "ename": "KeyError",
     "evalue": "'mean_train_score'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mKeyError\u001b[0m                                  Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-28-af0af99cf3e9>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[0mtest_means\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgrid_search\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcv_results_\u001b[0m\u001b[0;34m[\u001b[0m \u001b[0;34m'mean_test_score'\u001b[0m \u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      3\u001b[0m \u001b[0mtest_stds\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgrid_search\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcv_results_\u001b[0m\u001b[0;34m[\u001b[0m \u001b[0;34m'std_test_score'\u001b[0m \u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 4\u001b[0;31m \u001b[0mtrain_means\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgrid_search\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcv_results_\u001b[0m\u001b[0;34m[\u001b[0m \u001b[0;34m'mean_train_score'\u001b[0m \u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      5\u001b[0m \u001b[0mtrain_stds\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgrid_search\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcv_results_\u001b[0m\u001b[0;34m[\u001b[0m \u001b[0;34m'std_train_score'\u001b[0m \u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      6\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mKeyError\u001b[0m: 'mean_train_score'"
     ]
    }
   ],
   "source": [
    "# plot CV误差曲线\n",
    "test_means = grid_search.cv_results_[ 'mean_test_score' ]\n",
    "test_stds = grid_search.cv_results_[ 'std_test_score' ]\n",
    "train_means = grid_search.cv_results_[ 'mean_train_score' ]\n",
    "train_stds = grid_search.cv_results_[ 'std_train_score' ]\n",
    "\n",
    "x_axis = subsample_s\n",
    "\n",
    "plt.plot(x_axis, -test_means)\n",
    "#plt.errorbar(x_axis, -test_scores[:,i], yerr=test_stds[:,i] ,label = str(max_depths[i]) +' Test')\n",
    "#plt.errorbar(x_axis, -train_scores[:,i], yerr=train_stds[:,i] ,label = str(max_depths[i]) +' Train')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-0.47657295, -0.47410036, -0.47417503, -0.47353289, -0.47466273])"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_means"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### subsample=0.8"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 列采样参数 sub_feature/feature_fraction/colsample_bytree"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 3 folds for each of 5 candidates, totalling 15 fits\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.\n",
      "[Parallel(n_jobs=4)]: Done  12 out of  15 | elapsed:  3.7min remaining:   55.0s\n",
      "[Parallel(n_jobs=4)]: Done  15 out of  15 | elapsed:  4.4min finished\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=3, shuffle=True),\n",
       "             error_score=nan,\n",
       "             estimator=LGBMClassifier(bagging_freq=1, boosting_type='gbdt',\n",
       "                                      class_weight=None, colsample_bytree=1.0,\n",
       "                                      importance_type='split',\n",
       "                                      learning_rate=0.1, max_bin=127,\n",
       "                                      max_depth=7, min_child_samples=40,\n",
       "                                      min_child_weight=0.001,\n",
       "                                      min_split_gain=0.0, n_estimators=346,\n",
       "                                      n_jobs=4, num_class=9, num_leaves=70,\n",
       "                                      objective='multiclass', random_state=None,\n",
       "                                      reg_alpha=0.0, reg_lambda=0.0,\n",
       "                                      silent=False, subsample=0.8,\n",
       "                                      subsample_for_bin=200000,\n",
       "                                      subsample_freq=0),\n",
       "             iid='deprecated', n_jobs=4,\n",
       "             param_grid={'colsample_bytree': [0.5, 0.6, 0.7, 0.8, 0.9]},\n",
       "             pre_dispatch='2*n_jobs', refit=False, return_train_score=False,\n",
       "             scoring='neg_log_loss', verbose=5)"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "params = {'boosting_type': 'gbdt',\n",
    "          'objective': 'multiclass',\n",
    "          'num_class':9, \n",
    "          'n_jobs': 4,\n",
    "          'learning_rate': 0.1,\n",
    "          'n_estimators':n_estimators_1,\n",
    "          'max_depth': 7,\n",
    "          'num_leaves':70,\n",
    "          'min_child_samples':40,\n",
    "          'max_bin': 127, #2^6,原始特征为整数，很少超过100\n",
    "          'subsample': 0.8,\n",
    "          'bagging_freq': 1,\n",
    "          #'colsample_bytree': 0.7,\n",
    "         }\n",
    "lg = LGBMClassifier(silent=False,  **params)\n",
    "\n",
    "colsample_bytree_s = [i/10.0 for i in range(5,10)]\n",
    "tuned_parameters = dict( colsample_bytree = colsample_bytree_s)\n",
    "\n",
    "grid_search = GridSearchCV(lg, n_jobs=4,  param_grid=tuned_parameters, cv = kfold, scoring=\"neg_log_loss\", verbose=5, refit = False)\n",
    "grid_search.fit(X_train , y_train)\n",
    "#grid_search.best_estimator_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.4737232410844914\n",
      "{'colsample_bytree': 0.5}\n"
     ]
    }
   ],
   "source": [
    "# examine the best model\n",
    "print(-grid_search.best_score_)\n",
    "print(grid_search.best_params_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "ename": "KeyError",
     "evalue": "'mean_train_score'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mKeyError\u001b[0m                                  Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-32-f3823637d2cb>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[0mtest_means\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgrid_search\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcv_results_\u001b[0m\u001b[0;34m[\u001b[0m \u001b[0;34m'mean_test_score'\u001b[0m \u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      3\u001b[0m \u001b[0mtest_stds\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgrid_search\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcv_results_\u001b[0m\u001b[0;34m[\u001b[0m \u001b[0;34m'std_test_score'\u001b[0m \u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 4\u001b[0;31m \u001b[0mtrain_means\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgrid_search\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcv_results_\u001b[0m\u001b[0;34m[\u001b[0m \u001b[0;34m'mean_train_score'\u001b[0m \u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      5\u001b[0m \u001b[0mtrain_stds\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgrid_search\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcv_results_\u001b[0m\u001b[0;34m[\u001b[0m \u001b[0;34m'std_train_score'\u001b[0m \u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      6\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mKeyError\u001b[0m: 'mean_train_score'"
     ]
    }
   ],
   "source": [
    "# plot CV误差曲线\n",
    "test_means = grid_search.cv_results_[ 'mean_test_score' ]\n",
    "test_stds = grid_search.cv_results_[ 'std_test_score' ]\n",
    "train_means = grid_search.cv_results_[ 'mean_train_score' ]\n",
    "train_stds = grid_search.cv_results_[ 'std_train_score' ]\n",
    "\n",
    "x_axis = colsample_bytree_s\n",
    "\n",
    "plt.plot(x_axis, -test_means)\n",
    "#plt.errorbar(x_axis, -test_scores[:,i], yerr=test_stds[:,i] ,label = str(max_depths[i]) +' Test')\n",
    "#plt.errorbar(x_axis, -train_scores[:,i], yerr=train_stds[:,i] ,label = str(max_depths[i]) +' Train')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "再调小一点，由于特征包括原始特征+tfidf特征，是多了些"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 3 folds for each of 2 candidates, totalling 6 fits\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.\n",
      "[Parallel(n_jobs=4)]: Done   3 out of   6 | elapsed:   49.9s remaining:   49.9s\n",
      "[Parallel(n_jobs=4)]: Done   6 out of   6 | elapsed:  1.3min finished\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=3, shuffle=True),\n",
       "             error_score=nan,\n",
       "             estimator=LGBMClassifier(bagging_freq=1, boosting_type='gbdt',\n",
       "                                      class_weight=None, colsample_bytree=1.0,\n",
       "                                      importance_type='split',\n",
       "                                      learning_rate=0.1, max_bin=127,\n",
       "                                      max_depth=7, min_child_samples=40,\n",
       "                                      min_child_weight=0.001,\n",
       "                                      min_split_gain=0.0, n_estimators=346,\n",
       "                                      n_jobs=4, num_class=9, num_leaves=70,\n",
       "                                      objective='multiclass', random_state=None,\n",
       "                                      reg_alpha=0.0, reg_lambda=0.0,\n",
       "                                      silent=False, subsample=0.8,\n",
       "                                      subsample_for_bin=200000,\n",
       "                                      subsample_freq=0),\n",
       "             iid='deprecated', n_jobs=4,\n",
       "             param_grid={'colsample_bytree': [0.3, 0.4]},\n",
       "             pre_dispatch='2*n_jobs', refit=False, return_train_score=False,\n",
       "             scoring='neg_log_loss', verbose=5)"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "params = {'boosting_type': 'gbdt',\n",
    "          'objective': 'multiclass',\n",
    "          'num_class':9, \n",
    "          'n_jobs': 4,\n",
    "          'learning_rate': 0.1,\n",
    "          'n_estimators':n_estimators_1,\n",
    "          'max_depth': 7,\n",
    "          'num_leaves':70,\n",
    "          'min_child_samples':40,\n",
    "          'max_bin': 127, #2^6,原始特征为整数，很少超过100\n",
    "          'subsample': 0.8,\n",
    "          'bagging_freq': 1,\n",
    "          #'colsample_bytree': 0.7,\n",
    "         }\n",
    "lg = LGBMClassifier(silent=False,  **params)\n",
    "\n",
    "colsample_bytree_s = [i/10.0 for i in range(3,5)]\n",
    "tuned_parameters = dict( colsample_bytree = colsample_bytree_s)\n",
    "\n",
    "grid_search = GridSearchCV(lg, n_jobs=4,  param_grid=tuned_parameters, cv = kfold, scoring=\"neg_log_loss\", verbose=5, refit = False)\n",
    "grid_search.fit(X_train , y_train)\n",
    "#grid_search.best_estimator_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.4735328925862344\n",
      "{'colsample_bytree': 0.4}\n"
     ]
    }
   ],
   "source": [
    "# examine the best model\n",
    "print(-grid_search.best_score_)\n",
    "print(grid_search.best_params_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### colsample_bytree=0.4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 正则化参数lambda_l1(reg_alpha), lambda_l2(reg_lambda)感觉不用调了"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 减小学习率，调整n_estimators"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "best n_estimators: 3763\n",
      "best cv score: 0.46560184352320455\n"
     ]
    }
   ],
   "source": [
    "params = {'boosting_type': 'gbdt',\n",
    "          'objective': 'multiclass',\n",
    "          'num_class':9, \n",
    "          'n_jobs': 4,\n",
    "          'learning_rate': 0.01,\n",
    "          #'n_estimators':n_estimators_1,\n",
    "          'max_depth': 7,\n",
    "          'num_leaves':70,\n",
    "          'min_child_samples':40,\n",
    "          'max_bin': 127, #2^6,原始特征为整数，很少超过100\n",
    "          'subsample': 0.8,\n",
    "          'bagging_freq': 1,\n",
    "          'colsample_bytree': 0.4,\n",
    "         }\n",
    "n_estimators_2 = get_n_estimators(params , X_train , y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 用所有训练数据，采用最佳参数重新训练模型\n",
    "由于样本数目增多，模型复杂度稍微扩大一点？\n",
    "num_leaves增多5\n",
    "min_child_samples按样本比例增加到40"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LGBMClassifier(bagging_freq=1, boosting_type='gbdt', class_weight=None,\n",
       "               colsample_bytree=0.4, importance_type='split',\n",
       "               learning_rate=0.01, max_bin=127, max_depth=7,\n",
       "               min_child_samples=40, min_child_weight=0.001, min_split_gain=0.0,\n",
       "               n_estimators=3763, n_jobs=4, num_class=9, num_leaves=75,\n",
       "               objective='multiclass', random_state=None, reg_alpha=0.0,\n",
       "               reg_lambda=0.0, silent=False, subsample=0.8,\n",
       "               subsample_for_bin=200000, subsample_freq=0)"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "params = {'boosting_type': 'gbdt',\n",
    "          'objective': 'multiclass',\n",
    "          'num_class':9, \n",
    "          'n_jobs': 4,\n",
    "          'learning_rate': 0.01,\n",
    "          'n_estimators':n_estimators_2,\n",
    "          'max_depth': 7,\n",
    "          'num_leaves':75,\n",
    "          'min_child_samples':40,\n",
    "          'max_bin': 127, #2^6,原始特征为整数，很少超过100\n",
    "          'subsample': 0.8,\n",
    "          'bagging_freq': 1,\n",
    "          'colsample_bytree': 0.4,\n",
    "         }\n",
    "\n",
    "lg = LGBMClassifier(silent=False,  **params)\n",
    "lg.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 保存模型，用于后续测试"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "import _pickle as cPickle\n",
    "\n",
    "cPickle.dump(lg, open(\"Otto_LightGBM_org_tfidf.pkl\", 'wb'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 特征重要性"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame({\"columns\":list(feat_names), \"importance\":list(lg.feature_importances_.T)})\n",
    "df = df.sort_values(by=['importance'],ascending=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "plt.bar(range(len(lg.feature_importances_)), lg.feature_importances_)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "tfidf的特征重要性更高一些。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
